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Abstract 

Sentiment analysis seeks to identify the view- 
point(s) underlying a text span; an example appli- 
cation is classifying a movie review as "thumbs up" 
or "thumbs down". To determine this sentiment po- 
larity, we propose a novel machine-learning method 
that applies text-categorization techniques to just 
the subjective portions of the document. Extracting 
these portions can be implemented using efficient 
techniques for finding minimum cuts in graphs; this 
greatly facilitates incorporation of cross-sentence 
contextual constraints. 

Publication info: Proceedings of the ACL, 2004. 
1 Introduction 

The computational treatment of opinion, sentiment, 
and subjectivity has recently attracted a great deal 
of attention (see references), in part because of its 
potential applications. For instance, information- 
extraction and question-answering systems could 
flag statements and queries regarding opinions 
rather than facts ( .Cardie et al., 2003 ). Also, it has 
proven useful for companies, recommender sys- 
tems, and editorial sites to create summaries of peo- 
ple's experiences and opinions that consist of sub- 
jective expressions extracted from reviews (as is 
commonly done in movie ads) or even just a re- 
view's polarity — positive ("thumbs up") or neg- 
ative ("thumbs down"). 

Document polarity classification poses a sig- 
nificant challenge to data-driven methods, re- 
sisting traditional text-categorization techniques 
( Pang, Lee, and Vaithyanathan, 2002| i. Previous ap- 
proaches focused on selecting indicative lexical fea- 
tures (e.g., the word "good"), classifying a docu- 
ment according to the number of such features that 
occur anywhere within it. In contrast, we propose 
the following process: (1) label the sentences in 
the document as either subjective or objective, dis- 



carding the latter; and then (2) apply a standard 
machine-learning classifier to the resulting extract. 
This can prevent the polarity classifier from consid- 
ering irrelevant or even potentially misleading text: 
for example, although the sentence "The protagonist 
tries to protect her good name" contains the word 
"good", it tells us nothing about the author's opin- 
ion and in fact could well be embedded in a negative 
movie review. Also, as mentioned above, subjectiv- 
ity extracts can be provided to users as a summary 
of the sentiment-oriented content of the document. 

Our results show that the subjectivity extracts 
we create accurately represent the sentiment in- 
formation of the originating documents in a much 
more compact form: depending on choice of down- 
stream polarity classifier, we can achieve highly sta- 
tistically significant improvement (from 82.8% to 
86.4%) or maintain the same level of performance 
for the polarity classification task while retaining 
only 60% of the reviews' words. Also, we ex- 
plore extraction methods based on a minimum cut 
formulation, which provides an efficient, intuitive, 
and effective means for integrating inter-sentence- 
level contextual information with traditional bag-of- 
words features. 

2 Method 

2.1 Architecture 

One can consider document-level polarity classi- 
fication to be just a special (more difficult) case 
of text categorization with sentiment- rather than 
topic -based categories. Hence, standard machine- 
leai^ning classification techniques, such as sup- 
port vector machines (SVMs), can be applied to 
the entire documents themselves, as was done by 
Pang, Lee, and Vaithyanathan (2002| i. We refer to 
such classification techniques as default polarity 
classifiers. 

However, as noted above, we may be able to im- 



prove polarity classification by removing objective 
sentences (such as plot summaries in a movie re- 
view). We therefore propose, as depicted in Figure 
[I] to first employ a subjectivity detector that deter- 
mines whether each sentence is subjective or not: 
discarding the objective ones creates an extract that 
should better represent a review's subjective content 
to a default polarity classifier. 
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Figure 1: Polarity classification via subjectivity detec- 
tion. 

To our knowledge, previous work has not 
integrated sentence-level subjectivity detec- 
tion with document-level sentiment polarity. 
Yu and Hatzivassiloglou (2003^ provide methods 
for sentence-level analysis and for determining 
whether a document is subjective or not, but do not 
combine these two types of algorithms or consider 
document polarity classification. The motivation 
behind the single-sentence selection method of 



Beineke et al. (2004 ) is to reveal a document's 
sentiment polarity, but they do not evaluate the 
polarity-classification accuracy that results. 

2.2 Context and Subjectivity Detection 

As with document-level polarity classification, we 
could perform subjectivity detection on individual 
sentences by applying a standard classification algo- 
rithm on each sentence in isolation. However, mod- 
eling proximity relationships between sentences 
would enable us to leverage coherence: text spans 
occurring near each other (within discourse bound- 
aries) may share the same subjectivity status, other 
things being equal ( Wiebe, 1994 ). 

We would therefore like to supply our algorithms 
with pair- wise interaction information, e.g., to spec- 
ify that two particular sentences should ideally re- 
ceive the same subjectivity label but not state which 
label this should be. Incorporating such informa- 
tion is somewhat unnatural for classifiers whose 
input consists simply of individual feature vec- 
tors, such as Naive Bayes or SVMs, precisely be- 
cause such classifiers label each test item in isola- 
tion. One could define synthetic features or fea- 



ture vectors to attempt to overcome this obstacle. 
However, we propose an alternative that avoids the 
need for such feature engineering: we use an ef- 
ficient and intuitive graph-based formulation rely- 
ing on finding minimum cuts. Our approach is in- 
spired by Blum and Chawla (200 11), although they 
focused on similarity between items (the motiva- 
tion being to combine labeled and unlabeled data), 
whereas we are concerned with physical proximity 
between the items to be classified; indeed, in com- 
puter vision, modeling proximity information via 
graph cuts has led to very effective classification 
(Boykov, Veksler, and Zabih, 1999^ . 

2.3 Cut-based classification 

Figure |2l shows a worked example of the concepts 
in this section. 

Suppose we have n items xi, . . . ,x„ to divide 
into two classes Ci and C2, and we have access to 
two types of information: 

• Individual scores indj{xi): non-negative esti- 
mates of each Xj's preference for being in Cj based 
on just the features of Xi alone; and 

• Association scores assoc{xi, x^): non-negative 
estimates of how important it is that x j and x^ be in 
the same class. ^ 

We would like to maximize each item's "net hap- 
piness": its individual score for the class it is as- 
signed to, minus its individual score for the other 
class. But, we also want to penalize putting tightly- 
associated items into different classes. Thus, after 
some algebra, we arrive at the following optimiza- 
tion problem: assign the XjS to Ci and C2 so as to 
minimize the partition cost 
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The problem appears intractable, since there are 
2" possible binary partitions of the Xj's. How- 
ever, suppose we represent the situation in the fol- 
lowing manner. Build an undirected graph G with 
vertices {vi, . . . , Vn, s, t}; the last two are, respec- 
tively, the source and sink. Add n edges (s, Wj), each 
with weight indi{xi), and n edges {vi,t), each with 
weight ind2{xi). Finally, add (2) edges {vi,Vk), 
each with weight assoc{xi, Xk). Then, cuts in G 
are defined as follows: 

Definition 1 A cut [S, T) of G is a partition of its 
nodes into sets S = {s} U S" and T = {t} UT', 
where s ^ S' ,t ^ T'. Its cost cost{S, T) is the sum 



Asymmetry is allowed, but we used symmetric scores. 
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Figure 2: Graph for classifying three items. Brackets enclose example values; here, the individual scores happen to 
be probabilities. Based on individual scores alone, we would put Y ("yes") in Ci, N ("no") in C2, and be undecided 
about M ("maybe"). But the association scores favor cuts that put Y and AI in the same class, as shown in the table. 
Thus, the minimum cut, indicated by the dashed line, places AI together with Y inCi. 



of the weights of all edges crossing from S to T. A 
minimum cut ofG is one of minimum cost. 

Observe that every cut corresponds to a partition of 
the items and has cost equal to the partition cost. 
Thus, our optimization problem reduces to finding 
minimum cuts. 

Practical advantages As we have noted, formulat- 
ing our subjectivity-detection problem in terms of 
graphs allows us to model item-specific and pair- 
wise information independently. Note that this is 
a very flexible paradigm. For instance, it is per- 
fectly legitimate to use knowledge-rich algorithms 
employing deep linguistic knowledge about sen- 
timent indicators to derive the individual scores. 
And we could also simultaneously use knowledge- 
lean methods to assign the association scores. In- 
terestingly, Yu and Hatzivassiloglou (2003) com- 
pared an individual-preference classifier against a 
relationship-based method, but didn't combine the 
two; the ability to coordinate such algorithms is 
precisely one of the strengths of our approach. 

But a crucial advantage specific to the uti- 
lization of a minimum-cut-based approach is 
that we can use maximum-flow algorithms with 
polynomial asymptotic running times — and 
near-linear running times in practice — to ex- 
actly compute the minimum-cost cut(s), despite 
the apparent intractability of the optimization 
problem |C ormen, Leiserson, an d Rivest, 19901 
Ahuja, Magnanti, a nd Orlin, 1993).^ In con- 
trast, other graph-partitioning problems 
that have been previously used to formu- 



^Code available at http://www.avglab.com/andrew/soft.html. 



late NLP classification problems^ are NP- 
complete (,Hatzivassiloglou and McKeown, 1997| 
Agrawal et al., 2003HJoachims, 2003| l! 

3 Evaluation Framework 

Our experiments involve classifying movie re- 
views as either positive or negative, an appeal- 
ing task for several reasons. First, as mentioned 
in the introduction, providing polarity informa- 
tion about reviews is a useful service: witness 
the popularity of www.rottentomatoes.com. Sec- 
ond, movie reviews are apparently harder to clas- 
sify than reviews of other products ( |Tumey, 20021 
[Dave, Lawrence, an d Pennock, 2003 1). Third, the 
correct label can be extracted automatically from 
rating information (e.g., number of stars). Our data'* 
contains 1000 positive and 1000 negative reviews 
all written before 2002, with a cap of 20 reviews per 
author (312 authors total) per category. We refer to 
this corpus as the polarity dataset. 

Default polarity classifiers We tested support vec- 
tor machines (SVMs) and Naive Bayes (NB). Fol- 
lowing Pang et al. (12002 1 . we use unigram-presence 
features: the ith coordinate of a feature vector is 
1 if the corresponding unigram occurs in the input 
text, otherwise. (For SVMs, the feature vectors 
are length-normalized). Each default document- 
level polarity classifier is trained and tested on the 
extracts formed by applying one of the sentence- 
level subjectivity detectors to reviews in the polarity 
dataset. 

'Graph-based approaches to general clustering problems 
are too numerous to mention here. 

^Available at www.cs.cornell.edu/people/pabo/movie- 
review-data/ (review corpus version 2.0). 



Subjectivity dataset To train our detectors, we 
need a collection of labeled sentences. Riloff and 
Wiebe (2003.) state that "It is [very hard] to ob- 
tain collections of individual sentences that can be 
easily identified as subjective or objective"; the 
polarity-dataset sentences, for example, have not 
been so annotated.^ Fortunately, we were able 
to mine the Web to create a large, automatically- 
labeled sentence corpus^. To gather subjective 
sentences (or phrases), we collected 5000 movie- 
review snippets (e.g., "bold, imaginative, and im- 
possible to resist") from www.rottentomatoes.com. 
To obtain (mostly) objective data, we took 5000 sen- 
tences from plot summaries available from the In- 
ternet Movie Database (www.imdb.com). We only 
selected sentences or snippets at least ten words 
long and drawn from reviews or plot summaries of 
movies released post-2001, which prevents overlap 
with the polarity dataset. 

Subjectivity detectors As noted above, we can use 
our default polarity classifiers as "basic" sentence- 
level subjectivity detectors (after retraining on the 
subjectivity dataset) to produce extracts of the orig- 
inal reviews. We also create a family of cut-based 
subjectivity detectors; these take as input the set of 
sentences appearing in a single document and de- 
termine the subjectivity status of all the sentences 
simultaneously using per-item and pairwise rela- 
tionship information. Specifically, for a given doc- 
ument, we use the construction in Section 12.21 to 
build a graph wherein the source s and sink t cor- 
respond to the class of subjective and objective sen- 
tences, respectively, and each internal node Vi cor- 
responds to the document's i*^ sentence Sj. We can 
set the individual scores indi{si) to Pr^^f{si) and 
ind2{si) to 1 — Pr^J^{si), as shown in Figure |3l 
where Pr^J^{s) denotes Naive B ayes' estimate of 
the probability that sentence s is subjective; or, we 
can use the weights produced by the SVM classi- 
fier instead.' If we set all the association scores 
to zero, then the minimum-cut classification of the 

'We therefore could not directly evaluate sentence- 
classification accuracy on the polarity dataset. 

^Available at www.cs.comell.edu/people/pabo/movie- 
review-data/ , sentence corpus version 1.0. 

^We converted SVM output di, which is a signed distance 
(negative=objective) from the separating hyperplane, to non- 
negative numbers by 

f 1 > 2; 

indi(si) = I (2 + di)/4 -2 < di < 2; 

I di< -2. 

and ind2(si) = 1 ~ ind\{si). Note that scaling is employed 
only for consistency; the algorithm itself does not require prob- 
abilities for individual scores. 



sentences is the same as that of the basic subjectiv- 
ity detector. Alternatively, we incorporate the de- 
gree of proximity between pairs of sentences, con- 
trolled by three parameters. The threshold T spec- 
ifies the maximum distance two sentences can be 
separated by and still be considered proximal. The 
non-increasing function f{d) specifies how the in- 
fluence of proximal sentences decays with respect to 
distance d; in our experiments, we tried f{d) = 1, 
e^^'^, and l/d^. The constant c controls the relative 
influence of the association scores: a larger c makes 
the minimum-cut algorithm more loath to put prox- 
imal sentences in different classes. With these in 
hand^, we set (for j > i) 

1 otherwise. 

4 Experimental Results 

Below, we report average accuracies computed by 
ten-fold cross-validation over the polarity dataset. 
Section 1411 examines our basic subjectivity extrac- 
tion algorithms, which are based on individual- 
sentence predictions alone. Section 14.21 evaluates 
the more sophisticated form of subjectivity extrac- 
tion that incorporates context information via the 
minimum-cut paradigm. 

As we will see, the use of subjectivity extracts 
can in the best case provide satisfying improve- 
ment in polarity classification, and otherwise can 
at least yield polarity-classification accuracies indis- 
tinguishable from employing the full review. At the 
same time, the extracts we create are both smaller 
on average than the original document and more 
effective as input to a default polarity classifier 
than the same-length counterparts produced by stan- 
dard summarization tactics (e.g., first- or last-N sen- 
tences). We therefore conclude that subjectivity ex- 
traction produces effective summaries of document 
sentiment. 

4.1 Basic subjectivity extraction 

As noted in Section|3j both Naive Bayes and SVMs 
can be trained on our subjectivity dataset and then 
used as a basic subjectivity detector. The former has 
somewhat better average ten-fold cross-validation 
performance on the subjectivity dataset (92% vs. 
90%), and so for space reasons, our initial discus- 
sions will focus on the results attained via NB sub- 
jectivity detection. 

^Parameter training is driven by optimizing the performance 
of the downstream polarity classifier rather than the detector 
itself because the subjectivity dataset's sentences come from 
different reviews, and so are never proximal. 
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Figure 3: Grapli-cut-based creation of subjective extracts. 



Employing Naive Bayes as a subjectivity detec- 
tor (Extractpfg) in conjunction with a Naive Bayes 
document-level polarity classifier achieves 86.4% 
accuracy.^ This is a clear improvement over the 
82.8% that results when no extraction is applied 
(Full review); indeed, the difference is highly sta- 
tistically significant (p < 0.01, paired t-test). With 
SVMs as the polarity classifier instead, the Full re- 
view performance rises to 87.15%, but comparison 
via the paired t-test reveals that this is statistically 
indistinguishable from the 86.4% that is achieved by 
running the SVM polarity classifier on Extractf^g 
input. (More improvements to extraction perfor- 
mance are reported later in this section.) 

These findings indicate that the extracts pre- 
serve (and, in the NB polarity-classifier case, appar- 
ently clarify) the sentiment information in the orig- 
inating documents, and thus are good summaries 
from the polarity-classification point of view. Fur- 
ther support comes from a "flipping" experiment: 
if we give as input to the default polarity classifier 
an extract consisting of the sentences labeled ob- 
jective, accuracy drops dramatically to 71% for NB 
and 67% for SVMs. This confirms our hypothesis 
that sentences discarded by the subjectivity extrac- 
tion process are indeed much less indicative of sen- 
timent polarity. 

Moreover, the subjectivity extracts are much 
more compact than the original documents (an im- 
portant feature for a summary to have): they contain 
on average only about 60% of the source reviews' 
words. (This word preservation rate is plotted along 
the X-axis in the graphs in Figure |51) This prompts 
us to study how much reduction of the original doc- 
uments subjectivity detectors can perform and still 



accurately represent the texts' sentiment informa- 
tion. 

We can create subjectivity extracts of varying 
lengths by taking just the iV most subjective sen- 
tences'^ from the originating review. As one 
baseline to compare against, we take the canoni- 
cal summarization standard of extracting the first 
N sentences — in general settings, authors of- 
ten begin documents with an overview. We also 
consider the last N sentences: in many docu- 
ments, concluding material may be a good sum- 
mary, and www.rottentomatoes.com tends to se- 
lect "snippets" from the end of movie reviews 
( |Beineke et al., 2004l i. Finally, as a sanity check, 
we include results from the least subjective sen- 
tences according to Naive Bayes. 

Figure 0] shows the polarity classifier results as 
N ranges between 1 and 40. Our first observation 
is that the NB detector provides very good "bang 
for the buck": with subjectivity extracts containing 
as few as 15 sentences, accuracy is quite close to 
what one gets if the entire review is used. In fact, 
for the NB polarity classifier, just using the 5 most 
subjective sentences is almost as informative as the 
Full review while containing on average only about 
22% of the source reviews' words. 

Also, it so happens that at = 30, performance 
is actually slightly better than (but statistically in- 
distinguishable from) Full review even when the 
SVM default polarity classifier is used (87.2% vs. 
87.15%).'^ This suggests potentially effective ex- 
traction alternatives other than using a fixed proba- 



'This result and others are depicted in Figure (S) for now, 
consider only the y-axis in those plots. 

'"Recall that direct evidence is not available because the po- 
larity dataset's sentences lack subjectivity labels. 



"These are the sentences assigned the highest probability 
by the basic NB detector, regardless of whether their probabil- 
ities exceed 50% and so would actually be classified as subjec- 
tive by Naive Bayes. For reviews with fewer than A'^ sentences, 
the entire review will be returned. 

'^Note that roughly half of the documents in the polarity 
dataset contain more than 30 sentences (average=32.3, standard 
deviation 15). 
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Figure 4: Accuracies using N-sentence extracts for NB (left) and SVM (right) default polarity classifiers. 
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Figure 5: Word preservation rate vs. accuracy, NB (left) and SVMs (right) as default polarity classifiers. 
Also indicated are results for some statistical significance tests. 



bility threshold (which resulted in the lower accu- 
racy of 86.4% reported above). 

Furthermore, we see in Figure |4]that the most- 
subjective-sentences method generally outperforms 
the other baseline summarization methods (which 
perhaps suggests that sentiment summarization can- 
not be treated the same as topic -based summariza- 
tion, although this conjecture would need to be veri- 
fied on other domains and data). It's also interesting 
to observe how much better the last iV sentences are 
than the first N sentences; this may reflect a (hardly 
surprising) tendency for movie-review authors to 
place plot descriptions at the beginning rather than 
the end of the text and conclude with overtly opin- 
ionated statements. 

4.2 Incorporating context information 

The previous section demonstrated the value of 
subjectivity detection. We now examine whether 



context information, particularly regarding sentence 
proximity, can further improve subjectivity extrac- 
tion. As discussed in Section 12.21 and |3j con- 
textual constraints are easily incorporated via the 
minimum-cut formalism but are not natural inputs 
for standard Naive Bayes and SVMs. 

Figure |5] shows the effect of adding in 
proximity information. Extract j^g^p^^j^ and 
Extract ^yjt^j^Pygy. are the graph-based subjectivity 
detectors using Naive Bayes and SVMs, respec- 
tively, for the individual scores; we depict the 
best performance achieved by a single setting of 
the three proximity-related edge-weight parameters 
over all ten data folds'^ (parameter selection was 
not a focus of the current work). The two compar- 
isons we are most interested in are Extract j^/g+pfg^ 
versus Extractj^Q and Extractgyj^^pf.^^ versus 

"Parameters are chosen from T G {1,2,3}, /(d) £ 
{1, e^""*, and c € [0, 1] at intervals of 0.1. 



Extractgyjy^. 

We see that the context-aware graph-based sub- 
jectivity detectors tend to create extracts that are 
more informative (statistically significant so (paired 
t-test) for SVM subjectivity detectors only), al- 
though these extracts are longer than their context- 
blind counterparts. We note that the performance 
enhancements cannot be attributed entirely to the 
mere inclusion of more sentences regardless of 
whether they are subjective or not — one counter- 
argument is that Full review yielded substantially 
worse results for the NB default polarity classifier — 
and at any rate, the graph-derived extracts are still 
substantially more concise than the full texts. 

Now, while incorporating a bias for assigning 
nearby sentences to the same category into NB and 
SVM subjectivity detectors seems to require some 
non-obvious feature engineering, we also wish 
to investigate whether our graph-based paradigm 
makes better use of contextual constraints that can 
be (more or less) easily encoded into the input of 
standard classifiers. For illustrative purposes, we 
consider paragraph-boundary information, looking 
only at SVM subjectivity detection for simplicity's 
sake. 

It seems intuitively plausible that paragraph 
boundaries (an approximation to discourse bound- 
aries) loosen coherence constraints between nearby 
sentences. To capture this notion for minimum-cut- 
based classification, we can simply reduce the as- 
sociation scores for all pairs of sentences that oc- 
cur in different paragraphs by multiplying them by 
a cross-paragraph-boundary weight w G [0, 1]. For 
standard classifiers, we can employ the trick of hav- 
ing the detector treat paragraphs, rather than sen- 
tences, as the basic unit to be labeled. This en- 
ables the standard classifier to utilize coherence be- 
tween sentences in the same paragraph; on the other 
hand, it also (probably unavoidably) poses a hard 
constraint that all of a paragraph's sentences get the 
same label, which increases noise sensitivity. Our 
experiments reveal the graph-cut formulation to be 
the better approach: for both default polarity clas- 
sifiers (NB and SVM), some choice of parameters 
(including w) for Extract gyj^^p^^j^ yields statisti- 
cally significant improvement over its paragraph- 
unit non-graph counterpart (NB: 86.4% vs. 85.2%; 
SVM: 86.15% vs. 85.45%). 

5 Conclusions 

We examined the relation between subjectivity de- 
tection and polarity classification, showing that sub- 

''^For example, in the data we used, boundaries may have 
been missed due to malformed html. 



jectivity detection can compress reviews into much 
shorter extracts that still retain polarity information 
at a level comparable to that of the full review. In 
fact, for the Naive Bayes polarity classifier, the sub- 
jectivity extracts are shown to be more effective in- 
put than the originating document, which suggests 
that they are not only shorter, but also "cleaner" rep- 
resentations of the intended polarity. 

We have also shown that employing the 
minimum-cut framework results in the develop- 
ment of efficient algorithms for sentiment analy- 
sis. Utilizing contextual information via this frame- 
work can lead to statistically significant improve- 
ment in polarity-classification accuracy. Directions 
for future research include developing parameter- 
selection techniques, incorporating other sources of 
contextual cues besides sentence proximity, and in- 
vestigating other means for modeling such informa- 
tion. 
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